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The EM procedure is a principal tool for parameter estimation in the hidden Markov models. 
However, applications replace EM by Viterbi extraction, or training (VT). VT is computationally 
less intensive, more stable and has more of an intuitive appeal, but VT estimation is biased and 
does not satisfy the following fixed point property. Hypothetically, given an infinitely large sample 
and initialized to the true parameters, VT will generally move away from the initial values. We 
propose adjusted Viterbi training (VA), a new method to restore the fixed point property and 
thus alleviate the overall imprecision of the VT estimators, while preserving the computational 
advantages of the baseline VT algorithm. Simulations elsewhere have shown that VA appreciably 
improves the precision of estimation in both the special case of mixture models and more general 
HMMs. However, being entirely analytic, the VA correction relies on infinite Viterbi alignments 
and associated limiting probability distributions. While explicit in the mixture case, the existence 
of these limiting measures is not obvious for more general HMMs. This paper proves that under 
certain mild conditions, the required limiting distributions for general HMMs do exist. 

Keywords: Baum-Welch; bias; computational efficiency; consistency; EM; hidden Markov 
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1. Introduction 

Hidden Markov models (HMMs) have been called "one of the most successful statistical 
modelling ideas that have [emerged] in the last forty years" [8]. Since their classical 
application to digital communication in 1960s (see further references in [8]), HMMs have 
had a defining impact on the mainstream technologies of speech recognition [18, 19, 20, 
32, 35, 38, 40, 41, 46, 47, 48] and, more recently, bioinformatics [11, 12, 25]. Natural 
language [21, 36], image [30] and more general spatial [17] models are only a few of the 
numerous other applications of HMMs. 

Applications of HMMs inevitably face the problem of parameter estimation. Let us 
consider estimation of parameters of a finite-state hidden Markov model (HMM) given 
observations X\. M =Xi,...,x n on X 1:oo = X\, X 2 , ■ ■ ■ , the observable process of the HMM, 
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up to time n. For any real application, Xi can be assumed to take on values in X = M 
for some suitable D. Let Yi :oo = Yi, Y2, ■ ■ . , the hidden layer of the HMM, be a (time- 
homogeneous) Markov chain (MC) with state space S = {1,...,K}, transition matrix 
P = (pij) and initial distribution tt = ttP. To every state I £ S, there corresponds an 
emission distribution Pi (61) with density /; that is known up to the parametrization 
fl(x; 81), Oi £®i, where 0/ are rather general domains in M. d . When Y^, k > 1, is in state 
I, an observation Xk on Xk is emitted according to Pi (61) and independent of everything 
else. The Yi :00 process is also called a regime [31]. The maximum likelihood (ML) ap- 
proach has become standard for estimation of ijj = (¥,0), the HMM parameters, where 
= (#1, 02, ■ ■ ■ , Ok)- In part, this has been due to the well-known theoretical properties 
of (local) consistency and asymptotic normality generally enjoyed by the ML estimators 
(MLE). Perhaps a more significant reason for the widespread use of the ML approach 
has been the availability of the EM algorithm with its computationally efficient imple- 
mentation known as the Baum-Welch or simply Baum, or forward-backward algorithm 
[1, 2, 8, 14, 20, 39, 40]. 

Since EM can, in practice, be slow or computationally expensive, it is commonly re- 
placed by Viterbi extraction, or training (VT), also known as the Baum-Viterbi algo- 
rithm. VT appears to have been introduced in [19] by F. Jelinek and his colleagues at 
IBM in the context of speech recognition, in which it has been used extensively ever 
since [14, 18, 32, 35, 40, 41, 46, 47, 48]. Its computational stability (i.e., rapid exit) and 
intuitive appeal [14] have also made VT popular in natural language modeling [36], im- 
age analysis [30] and bioinformatics [4, 11, 13, 25, 37]. VT is also related to constrained 
vector quantization [10]. The main idea of the method is to replace the computationally 
costly expectation (E-step) of the EM algorithm with an appropriate maximization step 
that generally requires less intensive computer operations (otherwise, the two algorithms 
scale as K 2 n). In speech recognition, essentially the same training procedure was also 
described by L. Rabincr et al. in [22, 41] (see also [39, 40]) as a variation of the Lloyd al- 
gorithm used in vector quantization. In that context, VT has gained the name segmental 
K-means [14, 22]. The analogy with vector quantization is especially pronounced when 
the underlying chain is trivialized to i.i.d. variables, thus producing an i.i.d. sample from 
a mixture distribution. For such mixture models, VT was also described by R. Gray et al. 
in [10], where the training algorithm was considered in the vector quantization context 
under the name entropy constrained vector quantization (ECVQ). A better-known name 
for VT in the mixture case is classification EM (CEM) [9, 15], stressing that instead of 
the mixture likelihood, CEM maximizes the classification likelihood [4, 9, 15, 33]. VT- 
CEM was also particularly suitable for the early efforts in image segmentation [44, 45]. 
Also, for the uniform mixture of Gaussians with a common covariance matrix of the form 
a 1 ! (where / is the K x K identity matrix) and unknown a, VT, or CEM, is equivalent 
to the k-means clustering [9, 10, 15, 43]. 

1.1. VT estimation and relevance of VA to real applications 

The VT algorithm for estimation of -0 can be described as follows. Start with some 
initial values ip^ = (P(°),6>(°)) and (use the Viterbi algorithm to) find a realization of 



182 



J. Lember and A. Koloydenko 



Y\ in that maximizes the likelihood of the given observations. Any such n-tuple of states 
is called a Viterbi, or forced, alignment. An alignment partitions the original sample xi :ra 
into subsamples corresponding to distinct states. If regarded as an i.i.d. sample from 
Pi(9i), the subsample corresponding to state I gives rise to jj,f, the maximum likelihood 
estimate (MLE) of 61, I £ S. At step m + 1, these estimates replace 9^ m \ The transition 
probabilities are similarly estimated (by MLE) from the current alignment. The updated 
parameters <tp( m + 1 ) are subsequently used to obtain a new alignment, and so on. It can 
be shown that, in general, converges (to some ip*(x\ :n , tp^)) in finitely many steps 
m [22]; also, it is usually much faster than the Baum algorithm. Note that when each fi 
is modelled as a mixture, which is common in audio and visual processing, VT can be 
applied at both stages of this model - first, in its general form (i.e., as with fx general) and 
then in its CEM form to learn each individual Alternatively, the original HMM can, 
from the very beginning, be replaced by the equivalent one with hidden states (l,s(l)), 
where s(l) indicates the (sub)component of VT can then also be applied to this new 
HMM as, for example, has been done in the Philips speech recognition system [35]. 

Despite its attractiveness, VT can be challenged, as its estimators are generally biased 
and not consistent. This has been noted, at least in the case of mixtures, since [4], with 
a specific caveat issued in [49]. Simulations in [27] and [24] illustrate appreciable biases 
in VT estimation in the i.i.d. and more general HMM settings, respectively. At the same 
time, these facts are not surprising. Indeed, unlike EM, which increases the likelihood of 
ip given xi- n , VT increases the joint likelihood of the (hidden) state sequence y\-. n and 
the parameters ip, given x\- n . According to [34], under certain conditions, the difference 
between the two objective functions vanishes as D, the dimension of the emission Xi, 
grows sufficiently large relative to log(-fT), which can be realistic in isolated word recogni- 
tion [34]. However, as later clarified in [14], this does not imply closeness of the parameter 
estimates obtained by EM and VT (unless the algorithms are initialized identically) since 
both perform a local, rather than global, optimization. 

Certainly, unbiasedness and consistency arc neither necessary nor sufficient for a pro- 
cedure to perform well in applications [45]. However, there are indications that some 
applications, such as segment-based speech recognition [46], do prefer the standard, that 
is, EM-typc, likelihood maximization. Also, [46] notes that conventional speech recogniz- 
ers would prefer the 'smoother convergence' of ip( m ' under EM, presumably over the 
more abrupt, greedy convergence of ip( m > under VT. At the same time, it appears that 
in complex environments, VT can be appreciably simpler to implement than EM [46]. 
Hence, it appears to be sensible to combine the simplicity of VT's implementation with 
the desirable properties of EM. 

Indeed, there are variations of VT that use more than one best alignment or several 
perturbations of the best alignment [36]. VA, our type of adjusted VT, is of a different 
nature as it improves the estimation precision by means of analytic calculations and 
does not compute more than one optimal alignment per iteration. Moreover, we suggest 
that investigating such alternatives to VT and EM for real applications is nowadays 
much more appealing than ever before, thanks to the abundance of virtually infinite 
and freely available streams of audio and video (e.g., real-time broadcasting) as well as 
biological data. Actually, practitioners have already realized this by shifting from entirely 
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supervised to semi- and unsupervised modes of training [50]. One naive realization of 
these ideas is to simply use the estimates obtained from a labeled sample (i.e., with y\- n 
known) as the initial guess ip^ for a further unsupervised retraining. A more dedicated 
application would be model adaptation, wherein the model tfj^ (initially trained in any 
mode) may need to be adapted to a new environment (e.g., speaker) differing from the 
original one mostly, or only, by the emission parameters. Applicability of our adjusted 
VT for mixture models and situations when the transition probabilities are either known 
or nuisance is further discussed in Section 2.3. Finally, simulations in [27] and [24] clearly 
show that VA, our method of adjusting VT, does significantly improve the precision of 
VT estimation. In those experiments, the VA estimates are always comparable to the EM 
estimates, while the VA algorithm is only marginally more intensive than the baseline 
VT algorithm. 

1.2. The adjusted Viterbi training and contribution of this work 

Is it possible to adjust VT in an analytic way in order to enjoy both the desirable prop- 
erties of VT (fast convergence of ip^ m \ overall computational feasibility, simplicity of 
implementation and an overall intuitive appeal) and more consistent estimation? En- 
suring that an algorithm has the true parameters as its asymptotically fixed point turns 
out to be pivotal in constructing such adjusted estimators. Evidently, this fixed point 
property holds for EM, but not for VT. Namely, for a sufficiently large sample, the EM 
algorithm 'recognizes' and 'confirms' the true parameters. In contrast to this, an itera- 
tion of VT generally disturbs the correct values noticeably. In [27], we have proposed to 
modify VT in order to make the true parameters an asymptotically fixed point of VA, 
the resulting algorithm. 

In order to understand VA, it is crucial to understand the asymptotic behaviors of fif 
and j5™ , the maximum likelihood estimators based on the Viterbi alignment. Since the 
alignment depends on ijj^ ', the initial values of the parameters (and on the tie-breaking 
rule, which is ignored for the time being), so do /2™(-0^ , Xi :n ) and pfj(ip(°\Xi :n ). Note 
that, for ip to be asymptotically fixed by an estimation algorithm, it means that if 
ip = (P, #) are the true parameters and are used to compute the alignment, then 

ft?{ip,X lm ) — ► 0| a.s. VleS; p2M,X 1:n ) — » Pij a.s. V(i, j) G S 2 . (1.1) 

n — >oo J n — >oo 

The reason why VT does not enjoy the desired fixed point property is that (1.1) need 
not hold in general [4, 49]. Hence, in order to restore the above fixed point property in 
VT, we need to verify that the sequences in (1.1) converge almost surely and, provided 
they do, exhibit their limits. This paper essentially accomplishes these tasks. Namely, 
we show that (under certain mild conditions) the empirical measures (ij) , X\ :n ) ob- 
tained via the Viterbi alignment do converge weakly to a certain limiting probability 
measure Qi(ip) (2.5) and that, in general, Qi(ip) ^ Pi(9i)- In [24], we have shown that 
under general conditions on the densities fi(x;0i) (and, for 6;, closed subsets of R d ), the 
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above convergence Pp(ip, Xi :n ) => n ->oo Qi(ip) a.s. (properly introduced in (2.5)) implies 
convergence of fif, that is, 

pl(ip,Xi :n ) — ► fiiOip) a.s., where d =arg max / lnf^x-^DQiidx; ip). (1.2) 

Since, in general, Qi(ip) ^Pi(6i), clearly //j(V') need not equal argmaxg' $ \n fi(x;9[) x 
Pli&x-A). 

In order to obtain the above results, in Section 4, we extend Viterbi alignments, or 
paths, ad infinitum. Namely, considering (finite) Viterbi alignments with tie-breaking 
rules of a special kind, we prove the existence of a decoding v : X°° — ► S°° such that, 
for almost every realization xi :oa , the following property holds: for every m£N, there 
exists an n = n{xi :oo ,m) € N, n > m, such that the codeword v(xi :oo ) and the Viterbi 
alignment based on x\ m agree up to time m. To emphasize the dependence of v on ip, 
we will write v(xi :oc ; ip). It can then also be shown that when ip are the true parameters, 

the process V d = v(Xi :n ;ip) is regenerative. In particular, for any i,j E S, there exists 
Qiji^) ^ such that . qij{ip) — 1 for every i £ S and 

Ptji^Xun) ^ qyty). (1.3) 

Again, in general, pij ^ qij{ip). Reduction of the biases — Q\ and qij(ip) ~Pij is the 

main feature of the adjusted Viterbi training. 



1.3. Previous related work 

We are not aware of any systematic treatment of asymptotic reduction of the bias in 
VT estimation (without compromising the advantages of the VT algorithm over Baum- 
Welch) preceding [27]. In [23], however, a sequential version of VT ('the segmental K- 
means algorithm') is suggested, which can allegedly reduce the estimation bias asymptot- 
ically. The suggested modification appears substantially different from our adjustment, 
although we have been unable to evaluate the algorithm of [23] thoroughly due to the 
lack of detail in its description in [23] or anywhere else to date. 

Moreover, to the best of our knowledge, there has been no systematic study of asymp- 
totic properties of the Viterbi alignments to date besides certain attempts made by Kogan 
in [23] in the context of the sequential version of VT (see above) and, more recently, by 
Calicbc and Rosier in [7] and Calicbe in [5] . Both groups have given thorough treatments 
of certain special cases, mostly K = 2, but this, as we explain below, is too special. 

Importantly, it was recognized in [23] that under certain conditions, longer Viterbi 
alignments can be obtained piecewise. Roughly, the end-points of the pieces and the 
(random) times of their occurrence were termed 'special columns' and 'most informative 
stopping times', respectively. In [5, 7], related notions of 'meeting states' and 'meeting 
times' are used. Independently of [5, 7, 23], we have built our theory on the notion of 
nodes (roughly, observations emitted from the 'special columns'; see Section 3.1) and the 
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stopping times of their occurrence. If denned to be independent of a particular global 
tie-breaking rule, the meeting times of [5] would correspond to 'strong nodes' of order 
0, a particular type of nodes. More importantly, even our (general) nodes, which are 
essentially equivalent to the special columns of [23] and 'path crossings' of [5, 7], are not 
sufficiently general in the sense that HMMs with aperiodic and irreducible Markov chains 
need not necessarily have special columns, or nodes, infinitely often almost surely, despite 
the claim to the contrary made in Theorem 2 of [23] (stated without proof and implicitly 
cited in [14]). For a counterexample, we refer to Example 3.11 in [26], a downloadable 
preprint of this paper. Appropriate sufficient conditions to guarantee the desired property 
have also been given in [26] for the first time. Implicitly, the alignment process in [23] 
was recognized as regenerative with respect to the 'most informative stopping times'. 
The limiting alignment process of [5] is already explicitly shown to be regenerative. 
Rcgencrativity with respect to (the times of) nodes is also essential for our purpose of 
exhibiting the limiting measures Qi(tp) (2.5) and qij{ip) (1-3). 

Convergence of the Viterbi paths was, to the best of our knowledge, first seriously 
considered in [5, 7], where the existence of infinite alignments for certain special cases, 
such as K = 2 and some HMMs with additive white Gaussian noise, was also proven. 
While innovative, the main result of [7] (Theorem 2) makes several restrictive assumptions 
preventing its extension beyond the K = 2 case. As its by-product, this work extends 
some, and corrects other, results of [5, 7]. This is explained in detail in the appropriate 
paragraphs of Sections 3.1- 3.3 and Section 4. Also, note that our goal of exhibiting 
Qi(il>) and qij{4>) extends beyond solely defining infinite Viterbi alignments (the main 
goal of [7]). 

1.4. Organization of the rest of the paper 

First, in Section 2, we properly introduce the baseline and adjusted Viterbi training pro- 
cedures (Section 2.2) for HMMs. In Section 2.3, the adjusted Viterbi training is discussed 
in the context of the following two important variations on the main situation: the regime 
parameters arc known or nuisance. More general issues of implementation are discussed 
in Section 2.4. Sections 2.3 and 2.4 can be skipped without interruption of the main 
presentation. 

Recall that our ultimate goal has been asymptotic reduction of the bias in VT estima- 
tion for as general a class of HMMs as possible. The main goal of this paper, however, 
is to prove the existence of the limiting measures Qi(ip) (2.5) and qij(i/>) (1-3) that un- 
derpin our approach to achieving the ultimate goal. A significant effort has been made 
to achieve this accurately and under as non-restrictive conditions as possible. This is the 
main reason why we cannot directly reuse the tools used by others ([5, 7, 23]). As we reit- 
erate further in Section 3, the asymptotic behavior of the Viterbi alignment is not trivial 
and does require special tools. Thus, nodes and barriers, our main tools, are presented in 
Sections 3.1 and 3.3, respectively. In Section 3.2, we explain our piecewise construction 
of the proper Viterbi alignments. This is still at the level of individual realizations of the 
HMM process. In Section 3.3, barriers, on the other hand, extend our construction for 
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almost every realization of the HMM process. This is the essence of Lemmas 3.1 and 3.2, 
the first of the two main results of this paper. In Section 4, we define Vi :00 , the proper 
infinite alignment process. Finally, in the same section we prove the existence of the mea- 
sures Qi(ip) an d Qij{ip)i our second main result, using regencrativity of the augmented 
process (V"i :oo , Xi :OQ ) (Theorem 4.1 and Corollary 4.1). 

Exhibiting the measures Qi(ip) under very general conditions has necessitated several 
rather technical constructions, mainly used to prove Lemmas 3.1 and 3.2. Due to spatial 
limitations, they are not given here, but rather appear in [26]. 

2. The adjusted Viterbi training 

2.1. The model 

Recall that Yi :oo takes values in S = {1, . . ., K} and has transition matrix P. Let Yi :00 be 
irreducible and aperiodic, hence a unique 7r = 7rP exists. Let the emission distributions 
Pi(0i), I £ S, be defined on (X,B), where X and B are a separable metric space and the 
corresponding Borel c-algebra, respectively. Let /; be the density of Pi(9i) with respect 
to a suitable reference measure A on (X,B). 

Definition 2.1. The stochastic process X is a hidden Markov model if there is a (mea- 
surable) function h such that, for each n, 

X n = h(Y n ,e n ), where ei,e2, ■ ■ ■ are i.i.d. and independent ofY. (2-1) 

Hence, the emission distribution Pi(6i) is the distribution of h(l,e n ). The distribution 
of X is completely determined by the regime parameters P and the emission distributions 
Pi{0i), I £ S. The process X is also a-mixing and, therefore, ergodic [14, 16, 29]. 

2.2. Viterbi alignment and training 

Let 

n n 

Myi:n',Zl:n,i>) = P(^i:n = yi:n)\\_fyA x i'^y t ), where P(Yi : „ = VUn) = %i JJPj/i-iWJ 

i=l i=2 

be the likelihood functions of the yi :n , treated as parameters. Given xi :n , let V(xi :n ; ip) 
be the set of all maximum- likelihood estimates of y\ :n . These estimates, or paths, are 
efficiently obtained by the Viterbi algorithm and are called the Viterbi alignments. 

The non-uniqueness of the alignments causes substantial technical inconveniences. In 
Section 3.2, we specify unique v(xi- n ; if>) £ V(xi :n ;V') for every n £ N and x\ :n £ X n 
(and every tp) in a consistent manner that is suitable to prove the existence of Qi(ip). 
Meanwhile, the uniqueness of v(xi :n ;ip) is an assumption. VT estimation of ip is defined 
formally as follows (where I a is the indicator function of set A) : 
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(1) choose initial values for the parameters ip( k > = (p( fe ',0w), k = 0; 

(2) given ip( k \ current parameters, obtain the alignment = v(xi :n ;tp^ k '); 

(3) update the regime parameters p( fc+1 ) = f (p™) as given by 



= < Erix^^). -^-w^--. i>je5; (2 . 2) 



En— 1 t / (fc)\r / (fc) \ n— X 

m =x^}(^ )'«(<Wx) a S j w(t ,w )>0| 



(k) 

otherwise, 



(4) assign x m , m = 1, 2, . . . , n, to class and, equivalently, define empirical measures 

^(A;^*) ia . l!W )^ 2, m =i WttlW'" ) ^ AeBjleS . (2 .3) 

(5) for each class I £ S, obtain fj,f(ip( k \ x\ :n ), MLE of 9[, given by 

fr?(ip,x 1: n) d = argmax hififaefiFfidx^xun) (2.4) 
0,'ee, 7 

and for all / 6 S, let 

if 

,(fc+x)de£ J ffl(ip( k \x 1:n ), if ^ 7 {/} (w(a; 1: „;V (fe) ) m ) >0, 

m— 1 

otherwise. 



,(fe) 
7 z ' 



To better interpret VT, suppose that, at some step k, xp( k > = ip, thus t>w is obtained 
using the true parameters. Let y 1: „ be the actual hidden realization of Y\ :n . The training 
'pretends' that the alignment v^ k > is perfect, that is, that t>W = yi- n - If the alignment were 
indeed perfect, then the empirical measures P", I £ S, would be obtained from the i.i.d. 
samples generated from Pi(9i) and the MLE fif(tp, Xi- n ) would be natural estimators to 
use. Under these assumptions, P[ l (ip, Xi- n ) => Pi(9i) as n — > oo a.s. and, provided that 
{fi('',9i) : @i 6 ©z} is a f/-Glivenko-Cantelli class and 0/ is equipped with a suitable 
metric, we would have limn-^ fj" (tp , X\ :n ) = 9i a.s. Hence, if n is sufficiently large, then 
Pp(ip,X Un ) ps Pi(9i) and o\ k +1) = (if(ip,x 1:n ) ps (9/ = 6>, (fe) for every Z 6 5. Similarly, if the 
alignment is perfect, then lim n ^ooP^ (tp, Xi- n ) = P(Y"2 = j\Yi =i) =Pij, a.s. Thus, for the 
perfect alignment, V (fc+1) = (P( fc+1 ) , fl^ 1 ) ) ps (p( fe ), 0(*O) = V^ (fc) = ip, that is, ^ would be 
(approximately) a fixed point of the training algorithm. Certainly, the alignment, in 
general, is not perfect, even when it is computed with the true parameters. In particular, 
the empirical measures Pl l (ip,Xi :n ) can be rather far from those based on i.i.d. samples 
from Pi(9i). Hence, we have no reason to expect that Imin^ao p,f{^),X\- n ) = 9i a.s. and 
linin^oop™ \ip,X\ :n ) = Pij a.s. Moreover, we do not even know whether the sequences 
of empirical measures P™(V>, ^fx:n), or MLE estimators jj,f(ip,Xi :n ) and fi2j(ip,Xi :n ), 
converge almost surely at all. 
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As stated in Theorem 4.1, under certain mild conditions, there exist probability mea- 
sures Qi(ip), I £ S, such that 

PP(iP,X 1:n ) =► Q,(V>) a.s. (2.5) 

n — »oo 

From the proof of Theorem 4.1, it also follows (Corollary 4.1) that for every i £ S, 
there exist probabilities qn, . . . ,qnc such that (1.3) holds. In general, fJ.i(ip) ^ &i and 
Qiji^) ¥^Pij- I n order to reduce the biases 9i — and pij — qijty), we have proposed 

the adjusted Viterbi training. Namely, suppose that (1.2) and (1.3) hold and consider the 
mappings 

iph-tfuty), ip^qij(ip), l,i,j = l,...,K. (2.6) 

The functions in (2.6) do not depend on x± :n , hence the following corrections are well 
defined: 

^i{i>) = 6i-m{i>), RiM=Pii-QiM, l,i,j = l,.-.,K. (2.7) 

Based on (2.7), the adjusted Viterbi training replaces VT steps (3) and (5) as given 
below: 

(3) for every i, j £ S, update the matrix p( fe+1 ) = (p\j ) according to 



p^^P^+R^); (2.8) 



(5) for all / £ S, let 



K 

I -r, I 1 (h\ \ ■ <■ y 

(fe + 1) def . , , (fc) 



m— 1 

otherwise. 



rn—l 

7 l ' 



Provided n is sufficiently large, VA, as desired, has the true parameters ip as its (ap- 
proximately) fixed point. Indeed, suppose that =i>- From (1.2), fif(^ k \x 1 ., n ) = 
Ar(^^i:n) « HiW = w(V> (fe) ) for all I £ S. Similarly, from (1.3), p^{i/i^\x lm ) = 
pg (V, x 1:n ) w q l3 \$) = qij (V> (fc) ) for all i,j £ S. Thus, 

6^ 1) =ftW,x lin )+Ml>)™wW+Ml>) = 0i=0 {k \ I £ S, (2.9) 

= P« (^» + % WO « 9« WO + «ij WO = Pa = Plf . hj£S. (2.10) 
Hence, V (fe+1) = (P (fc+1) , 6» (fe+1) ) fa (P (fc) , 6»< fe ' ) = ?/> (fc) • 

Example 1 (Mixtures). Let Xi, X2, ... be i.i.d. and follow a mixture distribution with 
density ^2d—i^ifi{x\6i) and (positive) mixing weights iri. Such a sequence is an HMM 
with transition probabilities p^ = nj for all i, j £ S. In this special case, the alignment 
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and the measures Qi are easy to exhibit. Indeed, for any set of parameters ij> = (tt, 6), 
the alignment v(xi- n ;ip) can be obtained via a generalized Voronoi partition S(tp) = 
{S 1 (tP),...,S K (ij)}, where 

S X (V>) = {x eX-.^hix-Jx) > TTjfjix-e^^j G S}, (2.11) 



f-i 



Si(i/>) = {x eXimfiix-A) > 7r 3 f 3 (x;e 3 )yj e S}\ \J S k (^), l = 2,...,K. (2.12) 

fe=l 

Now, the alignment can be defined pointwise as follows: v{x\- n \ip) — (v(xx;tp), 
v{x n ;il))), where v(x;ip) = J2 k =i^^S k (tp)(x), which returns I if and only if x € Si(ip). 
The convergence (2.5) now follows immediately from the strong law of large numbers. 
Indeed, if tp are the true parameters and if the alignment is obtained based on ip, then 
the SLLN immediately gives Pp(il>) Qi(ip) almost surely, with densities qi(x;ip) of 
Qi(ip) oc f(x;ip)I SlW = {Y,k=i n kfk(x;O k ))I Sl (,i,), 1 = 1,2,..., if. Hence, the limit of the 
class-conditional MLE /i™ is given by 

W (V0=argmax / In/, (a;; e[) I > ' n k f k (x; 6 k ) I X(dx), (2.13) 



i(V>) = argmax / ln/;(x; 9[)\ V Tr k f k (x; 6 k ) ). 
e ; 'Ge, JsiW \fc=i / 

which, depending on the model, can differ from 9i significantly ([24, 27]). Also, (1.3) 
follows easily in this case (see [27] for further details). Namely, note that 

K 

7rr(^,X 1;n ) ±% qiW) = Y,*k / h(x-e k )\{dx). (2.14) 

Thus, in the special case of mixtures, the adjustments A/ and Ri are relatively easy to 
obtain and the adjusted Viterbi training is easy to implement. The simulations in [27] 
have largely supported the theory, demonstrating both the computational advantage of 
VA over EM and the increased precision of VA relative to VT. 



2.3. VA for 'independent training' 

Some applications, such as large vocabulary speech recognition systems [35], fix the 
regime parameters exogcnously. With the appropriate simplifications, the baseline and 
adjusted Viterbi training procedures, as well as the EM algorithm, immediately apply in 
such situations. In fact, in [24, 27], VA is discussed primarily in this simplified context. 
It can then be argued that, when the regime parameters are known, VA is unnecessary 
as MLI, the maximum likelihood estimation under the independence assumption (which 
can also be called independent training), applies. Let us discuss this issue in more detail. 
According to [31], MLI estimates the emission parameters (and possibly tt when P is 
unknown and not of interest) of general (ergodic) HMMs pretending that Yi, Y2, ■ ■ ■ , are 
independent, that is, the entire HMM follows a mixture model. This is appealing since 
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the marginal distribution of the emissions of any HMM (with a stationary regime) is in- 
deed the mixture with density J2k 7r fc/fc(s ^fc)- Thus, MLI is an instance of the maximum 
pseudo-likelihood (MPL) based on the above mixture approximation. The MLI-MPL es- 
timators for the emission parameters are (locally) consistent [31, 42] and can also be 
delivered by EM (for mixtures). Similarly to the general case, when computational re- 
sources do matter, VT (for mixtures) can also be used instead of EM in this case. As in 
the general case, Baum- Welch and VT scale identically, but their common computational 
complexity is now Kn, as opposed to K 2 n. The comparative computational performances 
of Baum- Welch and VT for mixtures and in the general case are also similar (the Baum 
algorithm involves more intensive operations). At the same time, as Example 1 in Sec- 
tion 2.2 above shows, the VT estimators are still not consistent and, in particular, the 
correction A; = 9i — Hi(ip), with /i;(V>) as in (2.13), can be significant. 

Let us make another point. Let 9 be fixed and let A/ and A* be the corrections obtained 
with and without the independence assumption (p.y = 7Tj, i,j S S), respectively. The 
following intuitive fact has been shown in [24] by simulation: A ; * < A; and the difference 
A; — A* widens as the dependence in P becomes stronger. This suggests that there is 
more to gain by adjusting VT for mixtures toward MPL-MLI than by adjusting VT for 
the actual HMM toward the true MLE. Thus, if one is interested in a computationally 
efficient approximation to (the Baum implementation of) MPL-MLI, the adjusted Viterbi 
training for mixtures is a sensible alternative to the baseline Viterbi training for mixtures. 
Also, note that VA for mixture models was studied in [27], where, in addition to the 
theoretical demonstration of the VT bias, it was also shown by simulations that this bias 
could be significantly reduced by VA. Importantly, in the mixture case, the VA corrections 
are often given explicitly, which simplifies the implementation of the algorithm. 

The independent training approach is also a natural choice when the underlying regime 
is a general ergodic process (not necessarily Markov) with an (unknown) stationary 
distribution tt. Even when not of direct interest, tt can and needs to be estimated. Again, 
if computational efficiency is an issue, VA for mixtures with unknown weights is an 
alternative to the Baum algorithm (for mixtures with unknown weights). Note that in 
this case, the corrections Ri = tt\ — qi(ip), with qi(ip) as in (2.14), should be used in 
addition to the A; corrections. Simulations in [27] showed a clear advantage of using 
both adjustments Ri and A; for mixture models with unknown tt. In particular, VA was, 
as usual, both superior to VT and only slightly inferior to EM, in precision. Remarkably, 
taking few steps to stabilize, VA also outperformed VT in total runtime. 

2.4. Implementation 

To implement VA in practice, explicit expressions for Qi(ip) (or fJ-iitp)) and qij{ip) are 
desirable. In general, however, these functions can be very difficult to compute with high 
precision. At the same time, as was just pointed out in Section 2.3 above, the corrections 
A; and Ri are easy to obtain for a broad class of mixture models including the most 
commonly used mixtures of Gaussians with equal and known covariances. Other details 
of VA implementation have been addressed in [27] and [24] for mixture and more general 
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models, respectively. For one example, [24] discusses the stochastically adjusted Viterbi 
training, an efficient implementation of VA for general HMMs when the corrections can- 
not be obtained analytically. Although simulations do require extra computations, the 
overall complexity of the stochastically adjusted VT can still be considerably lower than 
that of Baum- Welch. Certainly, this requires further investigation. Other practical issues 
are also a subject of continuing investigation. 

3. Infinite Viterbi alignment 

The idea of the adjusted Viterbi training is based on, firstly, the observation that the 
maximum likelihood path (the Viterbi alignment) differs substantially from the underly- 
ing Markov chain and, secondly, that these differences need to be accounted for in order 
for the overall HMM-based inference to be accurate. Our adjusted Viterbi training need 
not be the only method to correct the training process for these differences. However, any 
such method must inevitably appreciate the asymptotic properties of both the Viterbi 
alignment and the subsamples of the emissions as classified by the alignment. After all, 
it is these features that determine the properties of the VT estimators in general and the 
asymptotic bias of VT in particular. 

Even disregarding the non-uniqueness of the Viterbi alignment v(xi :n ) (dependence on 
tp is temporarily suppressed), the asymptotic behavior of v(Xi :n ) is not trivial since the 
(n + l)th observation can in principle change the entire alignment based on x\- n . Namely, 
let v{x\ :n ) and v{x\ :n+ i) be the alignments based on xx :n and £i !T ,+i, respectively. It 
might happen with positive probability that v(x\ :n )i ^ u(^i : n+i)i for every i = 1, . . . ,n. 
At the same time, the fact that the alignment changes infinitely often makes it difficult 
to define a meaningful infinite alignment process. For most HMMs, however, there is 
a positive probability of observing x\-. n such that, regardless of the value of the (n + 
l)th observation (provided n is sufficiently large), the alignments v(xi :n ) and u(xi:n+i) 
agree for a sufficiently long time u <n. Consequently, regardless of what happens in the 
future, the first u elements of the alignment remain constant. Provided that there is an 
increasing unbounded sequence Ui (it < Ui < 1*2 < • • •) such that the alignment up to Ui 
remains constant, infinite alignments can then be defined. The observation that for most 
commonly used HMMs, a typical realization Xi :oo has infinitely many u,; is the basis of 
our further analysis. 

Consider the following simple model that guarantees almost every xi :oo to have in- 
finitely many Uj's and provides an insight into a significantly more general scenario. Let 
state leS and event A <E B be such that P\{A) > 0, while Pi{A) = for 1 = 2,..., K. 
Thus, any observation x u £ A is almost surely generated under Y u = 1 and we say that 
x u indicates its state. Consider n to be the terminal time and note that any positive like- 
lihood path, including v(xi :n ), the maximum likelihood one, must go through the state 
1 at time u. This allows us to split the Viterbi alignment into v 1 and v , an alignment 
from time 1 through time u and an alignment from time u through time n, respectively. 
Namely, v 1 and v 2 maximize A(j/i :u ; Xu u ) and A(y u - n ; x u - n ) , the respective likelihoods. 
By concatenating v 1 with v\. n _ uJrX (removing the overlapping v\ = 1), wc obtain v(xi- n ) 
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that maximizes A(yi. n ; x\- n ). Clearly, any additional observations x n +i :m do not change 
the fact that x u indicates its state. Hence, for any extension of X\ :n , the first part of 
the alignment is always v 1 . Thus, any observation that indicates its state also fixes the 
beginning of the alignment. Since our HMM is a stationary process that has a posi- 
tive probability of generating state-indicating observations, there will be infinitely many 
such observations almost surely. (The overlap v\ = 1 is surely a nuisance since v\. n _ u+1 
maximizes A(y u+ i :n ; x u +\ :n ) with the initial distribution n replaced by (pij)j^s-) 



3.1. Nodes 



The above example is rather exceptional and we next define nodes to generalize the idea 
of state- indicating observations. 
First, consider the scores 

6 u (l) = max A((yi :il _i,Z); £!..„), (3.1) 

defined for all it > 1, x\. u G X u and states / in S. Thus, S u (l) is the maximum of the like- 
lihood of the paths terminating at u in state I. Note that 5\{l) = 7rj/;(xi). The recursion 

<Wi(i) = max-tfu^pij) fj(x u+ i) for all u > 1 and j G S (3.2) 

helps to verify that V(xi- n ), the set of all the Viterbi alignments, can be written as 
follows: 

V(xi- n ) = {v G S n : Vi G S, 8 n (v n ) > 6 n (i) and Vu : 1 < u < n, v u G t(u, v u +i)}, 

(3 - 3) 

where t(u,j) = {I G S : Vi G S 8 u (l)pij > 5 u (i)pij} for every u = 1, . . . , n. 

Thus, using (3.2), the Viterbi algorithm in its forward pass calculates 5 u (i), i = 
1, . . . , K 7 u=l, ...,n, and stores maximizers / G t(u,j) (with some tie-breaking rule) to 
yield 5 u +i(j) = S u (l)pijfj(x u+ i). The final alignment can then be found by backtracking 
as follows: v n G argmax^gg S n (i), v u G t(u, u u +i), u = n — 1, . . . , 1. 

Definition 3.1. Given x\ :u , the first u observations, the observation x u is said to be 
an l-node (of order 0) if 

S u {l)pij > S u (i)Ptj for all i,j G S. (3.4) 

We also say that x u is a node (of order 0) if it is an l-node for some I G S . We say that 
x u is a strong node if the inequalities in (3.4) are strict for every i,j G S, I. Definition 
3.2 below generalizes this one by including nodes of positive orders. 

Clearly, if x u is an Z-node, then I G t(u,j) for all j G S (see Figure 1). Consequently, 
if x\ :u is such that x u is an Z-node, then there exists v{xi :n ) G V{x\. n ) with v(xi :n ) u = I, 
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Figure 1. An example of the Viterbi algorithm in action. The solid line corresponds to the 
final alignment v(xi :n )- The dashed links are of the form (k,l) — (k + l,j) with I £ t(k,j) and 
are not part of the final alignment. For example, (l,3)-(2,2) is because 3 £ t(l, 2) , 2 £ t(2,3). 
The observation x u is a 2-node since we have 2 £ t(u,j) for every j £ S. Also, note that v(xi-. u ) 
is /ixed, that is, v(xi :u ) = v{xi :n )i:u- 

which guarantees (the existence of) a fixed alignment up until u. If the node is strong, 
then all the Viterbi alignments must coalesce at u. Thus, the concept of strong nodes 
circumvents the inconveniences caused by the non-uniqueness. Namely, regardless of how 
the ties are broken, every alignment is forced into I at u and any tie-breaking rule would 
suffice for the purpose of obtaining the fixed alignments. However tempting, strong nodes, 
unlike the general ones, are quite restrictive. Indeed, suppose our model allows for A 
with Pi(A) > and Pi(A) = 0, for I = 2, . . . ,K. Hence, for almost every x u £ A, we have 
^k(I) ^ > and 6u(i>) — for every i £ S, i 1. Thus, (3.4) holds and Xu is a 1-node. If, in 
addition, pij > for every j £ S, then for every i,j 6 S, i^ 1, the left-hand side of (3.4) 
is positive, whereas the right-hand side is 0, making x u a strong node. If, however, there 
is a j such that p±j = 0, which can easily happen if K > 2, then for this j, both sides are 
and x u is no longer strong. 

The concept of nodes (including higher order nodes to be defined below) is essentially 
the same as 'crossing Viterbi paths' of [7] or 'meeting times/states' [5] , where the existence 
of strong nodes is proved implicitly. The above works assume that the entries of P, the 
transition matrix, are positive, which excludes our previous example of x u being a node 
and not a strong node. Using the concept of nodes, let us briefly analyze the results 
of these works. In [7], there are two main theorems. In terms of nodes, Theorem 1 of 
[7] states the following. Let jo £ S be a recurrent state. Let ia £ S be such that for all 
i,j,ke S, i ^ i , 

P jo ({x e X :pji B fi (x)p iok > Pjift(x)p lk }) > 0. (3.5) 

Then, almost every realization of HMM has infinitely many nodes. Up to notation, the 
condition (3.5) above is stated as it appears in [7]. However, this theorem is proved in 
[7] under the following stronger condition (3.6) (in [6], the authors of [7] have recently 
confirmed this to be a misprint): 



Pj {{x £ X :pji fi (x)p iok >Pjifi(x)pik Vi,j,k £ S,i 7^0}) > 0. 



(3.6) 
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To see how significantly this alteration weakens the theorem, let A C X be the set as in 
(3.6) and let us first show that any x u € A is a strong io _ri °dc. Indeed, fix i G S, i 7^ to- 
There then exists j (depending on i) such that S u (i) = 5 u -i{j)pjifi(x u ). Next, for every 
k, S u (i)p lk = S u -i{j)pjifi(x u )pik and thus 

8 u {i)Pik < bu-i{j)Pji fi {xu)Pi o k < rnax8u-i(j)Pjiofi { x v)Piok = 8 u (io)Pi k- 

3 

Thus, (3.6) implies that every observation from A is a strong node. Since jo is recurrent 
and A has a positive Pj -probability, clearly there are almost surely infinitely many such 
nodes. The existence of A satisfying (3.6), however, appears to be more of an exception 
than a rule. Note that (3.6) does not hold if P contains a in every row or in every column. 
Another important example of HMMs for which A satisfying (3.6) need not exist is the 
HMM with additive white Gaussian noise (Example 1 of [5, 7]). In fact, it is stated in 
[7] that the assumption of their Theorem 1 is satisfied for this model independently of 
the transition matrix. In [6], the authors of [5, 7] have recently confirmed accidental 
omissions of the intended positivity condition, which, from the example below, can be 
seen to be crucial for Theorem 1 of [7], as well as Theorems 3 and 6 of [5]. Also, note 
that the following example does not require that P contain zeros in every row or column 
and is hence substantially different from the example given above. Thus, let K = 3 and 
let P13 = be the only zero entry of P. This already rules out (3.6) for io = l and iq = 3. 
Following [7], in the additive white Gaussian noise model, the emission density fa is 
univariate normal with mean i = 1,2,3 and variance 1. Let x be such that 

Pj2h{x)p2k > Pjifi(x)pik Vi, j, k e S,i ^ 2. 

In particular, with j = 2, p 2 2h{x)P23 > P2zh{x)Pz3 and p 2 2f 2 (x)p 2 i > P2ifi(x)pu. 
Hence, 

h(x) > P33 h{x) > Pn , 3 7 , 
h{x) P22 fi(x) P22 

Since p\\ and P33 are both positive, one can easily find P22 > sufficiently small for (3.7) 
to fail, implying that io ^ 2. Therefore, (3.6), the (corrected) hypothesis of Theorem 1 
of [7] , which is also the hypothesis of Theorem 3 of [5] , need not hold for the HMM with 
the additive Gaussian noise and P general. 

We next extend the notion of nodes (Definition 3.1) to account for the fact that a 
general ergodic P can have a zero in every row, in which case nodes of order need 
not exist. Indeed, suppose that x\ :u is such that 5 u (i) > for every i G S. In this case, 
(3.4) implies that pij > for every j G S (the Ith row of P must be positive) and (3.4) is 
equivalent to S u (l) > max i (maxfe(|^-)<5 tl (i)). 

First, we introduce p^\u), the maximum likelihood of the paths connecting states i 
and j at times u and u + r, respectively. Thus, for each u > 1 and r > 1, let 

Pij\ U ) ^ m a *Piqifqi(Xu+l)p qi q 2 fq 2 {Xu+2)Pq 2 q a ' ' ' Pq r -!q r fq r {Xu+r)Pq T j ■ 
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Figure 2. 2nd order 2-node, x u -i is a 3rd-order 3-node. Any alignment v(xi : „) has 

v{xx-. n ) u = 2. 

Also, note that p\j(u) = max geS p^ l ' {u)f q (x u+r )p q j, where = f Py, u > 1. Re- 

cursion (3.2) then generalizes as follows: for all r > u > 1, for each j G 5, (5 u +i(j) = 
max ieS (^u-r(i)Pij (u - r))/j-(x- u+ i). 

Definition 3.2. Let 1 <r < n, I < u < n — r and let I £ S . Given x\. u+r , the first u + r 
observations, said to be an /-node of order r if 

S»(!)p$(ii) > f° r ^1 i,j G S. (3.8) 

x u is said to be an rth-ovder node if it is an rth-order l-node for some I £ S . x u is said to 
be a strong node of order r if the inequalities in (3.8) are strict for every i,j G S, i^l. 

Note that any rth-order node is also a node of order r' for any integer r < r' < n 
and thus, by the order of a node, we will mean the minimal such r. Also, note that for 
K = 2, a node of any order is a node of order 0. Hence, positive order nodes only emerge 
for K > 3. If x u is an /-node of order r, then regardless of what the observations after 
x u+r are, x u remains an /-node of order r. Moreover, it follows from a decomposition of 
V(xi :n ) similar to that of (3.3) that there exists v{xi- n ) £ V(xi :n ) such that v(xi :n ) u = I. 
The difference between nodes (of order 0) and nodes of positive order r is that for 
v(xi:n)u = I to hold, u needs to be at least r steps before n (n > u + r). Otherwise, for m 
such that u < m < u + r, it might happen that no alignment v{x\ :m ) 6 V(xi :m ) satisfies 
v(xi:m)u = I- The role of higher order nodes is similar to that of nodes. Namely, provided 
a proper tie-breaking rule is given the existence of a higher order node x u ensures the 
existence of a fixed alignment up to u. At the same time, allowing nodes of higher orders 
removes the positivity restriction on rows of P. 

Although implicit (and defined relative to a fixed and global tie-breaking rule), nodes 
of orders possibly higher than are also a main tool in [5, 7]. Specifically, statements K' 
and K" , underpinning the main results of [7], are interpreted in terms of nodes as follows. 
K': almost every realization of an HMM has infinitely many (variable order) nodes. (The 
node orders r%, r%, ... in K 1 can depend on the realization cci :00 and hence need not be 
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almost surely bounded.) K": almost every realization of an HMM has infinitely many 
nodes of order 0. (Thus, K' implies K" and for K = 2, K 1 is equivalent to K" .) Lemmas 
3.1 and 3.2 below give significantly stronger results, which also allow for an algorithmic 
construction of infinite piecewise alignments. 

3.2. Piecewise alignment 

Let X\- n be such that x Ui is an Z^-node of order r, 1 < i < k, for some k < n and as- 
sume that Uk + r < n and iij+i > Ui + r for all i = 1,2, . . . , k — 1. Such nodes are said 
to be separated. It follows from the definition of nodes that there exists a Viterbi align- 
ment vi-.n G V(xi :n ) such that v Ui = li for every i = 1 < k. Indeed, Definition 3.2 imme- 
diately implies the existence of a Viterbi alignment v[. n G V{x\- n ) with v' = Ik- The 
same definition and optimality of backtracking by the Viterbi algorithm imply that 
(wi: Uk _ 1+ r,v' Uk i+r+1 . n ) G V(x 1:n ) for some prefix w v . Uk _ 1+r with w Uk _ 1 = h-i- Con- 
tinuing in this manner down to node x Ul , we exhibit vi :n with v Ui = li, i = 1, 2, . . . , k. 

Let us discuss the assumption Ui+i > Ui + r, i — 1, 2, . . . , k — 1. The fact that x Ui is an 
rth-order ^-node guarantees that when backtracking from m + r down to Ui , ties can be 
broken in such a way that, regardless of the values of x u + r +i :n and how ties are broken 
in between n and Ui + r, the alignment goes through li at Ui. At the same time, segment 
m,...,Ui + r is 'delicate', that is, unless strong node, breaking ties arbitrarily 

on Ui, . . . ,Ui + r can result in v(xi :n ) Ui ^ U. Hence, when neither x Ui nor x Ui+1 is strong 
and Mi+i < Ui + r, breaking ties in favor of x Ui can result in v Ui+1 ^ h+i - Note that such 
a pathological situation is impossible if r = and might be rare in practice for r > 0. 
Finally, note that this assumption is not restrictive since it is always possible to choose 
from any sequence of nodes a subsequence of nodes that are separated. 

To formalize the piecewise construction introduced above, let 

yV l (x 1:n ) = {v G S n : v„ = I, A(v; x 1:n ) > A(w; x 1:n ) Vw G S n : w„ = I}, 
V l (xi-.n) = {v G V{xi- n ):v n = /}, for all n > 1,1 G S and x lm G X n , 

be the sets of maximizers of the constrained likelihood and the subset of maximizcrs 
of the (unconstrained) likelihood, respectively, all elements of which go through I at u. 
Note that, unlike W ( (a;i :Il ), V {x\, n ) might be empty. It can be shown that V (xi :n ) ^ 
implies that V l (xi- n ) = W (xx :n ). Also, let the subscript (I) stand for using (pu)ies as 
the initial distribution in place of tt. Thus, the sets V^{x\ :n ) and W^(xi : „), m G S, will 
also be used. 

The piecewise construction can be formulated as follows. Suppose that there exist 
h , . . . , h G S and u\ , . . . , Uk > 1 , r\ , . . . , r k > with u\ + n < u 2 + r 2 < ■ ■ ■ < u k + r k < n 
such that x Ui is an Zi-node of order Ti for every i < k. There then exists an alignment 
v(x 1:n ) = (v 1 , . . . ,v k+1 ) G V(xi; n ), where v 1 G W h (x 1:ui ), 

i''eWf; i _ i) (j; 11 ,_ 1+1:B ,), 2<i<k, and v k+1 eV ( i k) (x Uk+1:n ). (3.9) 
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Moreover, for every i = 1,2, . . ., k, w{i) = [v 1 , . . . , v l ) G V li {x\- Ui ). Thus, when a node is 
observed at time Uk, the alignment up to Uk becomes fixed, yielding natural extensions of 
finite alignments for n — > oo. Besides providing the tool for the asymptotic analysis, the 
piecewise construction is also of computational significance. Indeed, note that once x Ul 
has been recognized to be a node and w(l) has been constructed, the memory allocated 
for storing x\. Ul and t(u,j) (see (3.3)) for u < u\ and j G S is no longer needed and can 
be freed. 

Thus, if xi :00 has infinitely many nodes {x Uk }k>i that are separated, then v(xi :00 ), an 
infinite piecewise alignment based on the node times {uk(xi :oo )}k>i can be defined as fol- 
lows. If the sets V\?A. 1 - ) (x Ui _ 1 +i :Ui ), i > 2, as well as W' 1 {x\- Ul ) are singletons, then (3.9) 

immediately defines a unique infinite alignment v(xi- 00 ) = (v (xi :Ul ),v (x Ul +i :U3 ), . . .). 
Otherwise, ties must be broken. In order for our infinite alignment process to be re- 
generative, a natural consistency condition must be imposed on rules to select unique 
v(x Un ) from W h (x 1:ui ) x 

{Xui+l:u 2 ) X ■•■ X W l (j k ^(x Uk _ 1 + l :Uk ) X V(; fc ) (x Uk + l :n ) . 

Resulting infinite alignments, as well as decoding v:X°° — » S°° based on such align- 
ments, will be called proper. This condition is, perhaps, best understood by the fol- 
lowing example. Suppose, for some xi-,s G X 5 , that Wh\ (^i:s) = {12211,11211} and 
suppose that the tie is broken in favor of 11211. Now, whenever ~W}n{x' l .^] contains 
{1221, 1121}, we naturally require that 1221 not be selected. In particular, we break the 
tie in Wh\(xi : i) = {1221,1121} by selecting 1121. Subsequently, 112 is selected from 
W^(xi : 3) = {122, 112}, and so on. It can be shown that a decoding by piecewise align- 
ment (3.9) with ties broken in favor of min (or max) under the reverse lexicographic 
ordering of S n , n G N, is a proper decoding. 

Example 2 (Mixtures revisited). Consider the mixture model as in Example 1. 
In this case, an observation x u is an ^-node if and only if S u (l) > 5 u (i) for every 
ieS. In particular, this implies that every observation is an ^-node (of order 0) for 
some / G S. Recursion (3.2) can then be written for any u > 2 and i G S as S u (i) = 
maxjgs S u -i(j)TTifi(x u ) = CTTifi(x u ), where c does not depend on i. Hence, x u is an l- 
node if and only if TTifi(x u ) > 7Tj/,(a; u ) for all i G S. Therefore, the alignment can be 
obtained component- wise: v(xi :n ) = (v(x\), . . . , v(x n )), where 

v(x) = argmax7Ti/i(a;). (3.10) 

Clearly, the alignment is proper if the ties in (3.10) are broken consistently, that is, if 
v(x) is indeed a well-defined function of x. 

Example 2 helps to understand the necessity of breaking ties consistently. If our sole 
goal were to construct infinite alignments, then any piecewise (not necessarily proper) 
alignment would suffice. However, the existence of Qi(ip), I G S, requires more. Indeed, 
suppose that the right-hand side of (3.10) is not unique for some x, an atom of, say P™, 
as defined in (2.3). If the selection in (3.10) is consistent, say, v(x) = 1, then, in the limit, 
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x will also be an atom of Qi(tp). Otherwise, if ties in (3.10) are broken arbitrarily, then 
the limiting measures might not exist at all. 

Also, note that we break ties locally, that is, within individual intervals + 1, . . . , Ui, 
i>2, enclosed by the adjacent nodes. This is in contrast to global ordering of V(a;i:oo), 
such as the one in [5, 7], which ignores decomposition (3.9). A global rule can fail to 
produce an infinite alignment going through infinitely many nodes unless the nodes are 
strong (as assumed in [5, 7]). 

3.3. Barriers 

To test whether Xn IS cL node of order r requires the entire realization xi :u + r (Definition 
3.2). In particular, for an arbitrary prefix Xi. w G X w and m <u, the (w + m + l)th element 
of (i' 1:m ,i„- ra : U + r ) need not be a node relative to (.i' li illB ,i„_ m:u+r ), even when x u is 
a node of order r relative to xi :u+r . We show below that typically, a block x\. k G X k 
(k > r) can be found such that for any w > 1 and any x[. w G X w , the (w + k — r)th 
element of (x' 1:w , x\. k ) is a node of order r (relative to (x' 1:w , x\. k )). Sequences x\. k that 
ensure the existence of such persistent nodes will be called barriers. 

Definition 3.3. Given I G S, x\. k G X is called a (strong) l-barrier of order r > 
and length k > 1 if, for any w > 1 and every x\. w G X w , (x' 1 . w ,x\. k ) is such that 
(xi. w ,x\. k ) w +k-r is a (strong) l-node of order r. 

Note that any observation from the set A considered in (3.6) is a barrier of length 1. 
In particular, any observation that indicates a state is a barrier of length 1. 

Next, we state and discuss Lemmas 3.1 and 3.2, the first of the two main results of 
this paper. First, let Gi = PIg-cIoscc! p,(g-8,)=i @ denote the support of the family Pi(8i), 
(9 ( G6i, foralUGS. 

Definition 3-4- We call a subset C C S a cluster, if the following conditions are satis- 
fied: 

mmpJf](G l n{xeX:f l (x)>0})\>0 and Pj ( f] d ) = Vj g C. 

Hence, a cluster is a maximal subset of states such that Gc = Hiec ^« ^ s 'detectable'. 
Distinct clusters need not be disjoint and a cluster can consist of a single state. In this 
latter case, such a state is not hidden since it is indicated by any observation which 
it emits. When K = 2, S is the only cluster possible since otherwise, all observations 
would reveal their states and the underlying Markov chain would cease to be hidden. In 
practice, many other HMMs have the entirity of S as their (necessarily unique) cluster. 

The proof of the following lemma is rather technical and can be found in [26], Ap- 
pendix 5.1, pages 26-39. 
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Lemma 3.1. Assume that for each state I G S , 



Pl[ < x G X : fi(x) msx(pji) > max max^,) ) )■ > 0. (3- 11) 

\ jes ies^i \ jes j ) J 

Moreover, assume that there exists a cluster C C S and an integer m < oo such that 
the mth power of the substochastic matrix Q = (Pij)i,jeC is strictly positive. Then, for 
some integers M and r, M > r > 0, there exist B = B% X • • • X £?m C X M , q\-M G S M 
and I G S such that every x\. M G B is an l-barrier of order r (and length M), qM-r = h 
P(X 1:M G B\Y 1:M = quu) > and P{Y UM = 9lm) > 0. 

Lemma 3.1 implies that P(Xi : m G B) > 0. Also, since every element of B is a barrier 
of order r, the ergodicity of X therefore guarantees that almost every realization of Xi :oo 
contains infinitely many Z-barriers of order r. Hence, almost every realization of Xi :oc 
also has infinitely many Z-nodcs of order r. 

Let us briefly analyze (3.11) and the existence of a cluster C assumed in Lemma (3.1). 
First, consider the case when S itself is a cluster. This occurs, for example, if the supports 
of all the emission distributions coincide. Then, the substochastic matrix (pij)ijeC = P 
and apcriodicity of P implies that P™ is strictly positive for some power m. Hence, the 
cluster assumption is satisfied in this case. Our cluster assumption essentially generalizes 
assumption Al of [5, 7], which requires P, the transition matrix, to be strictly positive and 
the supports Gi to be all equal. As already pointed out, the assumption of strict positivity 
of P becomes rather restrictive when K > 2. Moreover, [26], Example 3.11, shows that 
the cluster assumption is not only sufficient but also necessary for nodes (and barriers) 
to exist. We also point out that the proof of the existence of nodes in [5] (Theorem 2) 
heavily relies on the supports being equal, which is also crucial for assumption A2 [5, 7] 
and which is not assumed in Lemma 3.1. 

Note that (3.11) basically says that for every state I G S, there is a set where the mea- 
sure Pi{9i) 'dominates', that is, {x G X : fi(x) max 3t£S Pji > max ieS ^;(/i(a;) rmtK je sPji)} 
is of positive A-measure. We are not aware of any HMMs used in practice for which 
this assumption does not hold. Moreover, for many models (sec Example 3 below), it 
is actually sufficient for proving the existence of barriers that (3.11) holds for at least 
one state Z, which, provided that the emission distributions Pi(0i), I G S, are all dis- 
tinct, is always the case. Also, note that for the mixture model, (3.11) simplifies to 
Pi({x : fi(x)TTi > fi(x)iTi,yi 7^ I}) > and that assumption (3.11) is weaker than (3.6) 
since the latter implies that 



P io [ <xGX:f io (x) maxp,, > max /; (x) maxp^ > > 0. 
V I jes ies,i^i \ jes J J J 

Example 3 (K = 2). S = {1,2} is the only cluster. Assume P to be strictly positive. 
Thus, the cluster assumption of Lemma 3.1 is fulfilled. Assume Pi{0\) and P2(#2) to be 
distinct. Following [5], consider the following three cases. Case 1: p\\ > P21 (equivalently, 
P22 > P12); case 2: pn < P21 (equivalently, P22 < P12); case 3: pn — P21 (equivalently, 
P22 = Pv2)- Note that since \({x e X : fi(x) ^ f2(x)}) > (the two emission distributions 
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differ), the sets X 1 d = {x £ X : f 1 {x)p 11 > f 2 (x)p 22 }, X 2 = f {x £ X : fi(x)p n < f 2 (x)p 22 } 
satisfy 

X(Xx) > or X(X 2 ) > 0. (3.12) 

Without loss of generality, assume pn >p 22 , hence X(Xi) > 0. It is then not hard to 
exhibit strong 1-barriers in case 1. Indeed, in this case, a Viterbi path v(xi :n ) can switch 
states only at nodes, that is, v(xi :n ) u:u +i = I ^ j, implies that x u is a strong l- 

node. An integer k can then be chosen sufficiently large for any sequence Zi-k £ X* to be 
a strong 1-barrier. Suppose that this were not the case and hence that no Zj, 1 < i < k, 
would be a 1-node. It could then be shown that no Zi could be a 2-node either, hence 
corresponding fc-segments of Viterbi paths v(x 1:n ), n> k, would have to be constant, 
namely all l's or all 2's. However, k is so large that segment 211 . . .12 is more optimal 
than 22 . . .2, implying the presence of a strong 1-nodc. 

Thus, in case 1, the occurrence of infinitely many barriers (or nodes) does not require 
any additional assumptions. In particular, assumptions Al (the supports being equal) 
and A2 (log-ratio of the densities being square-integrable) of [5, 7] are unnecessary for 
proving the results of Theorems 7, 8 and 9 of [5]. Furthermore, assumption (3.11) of 
Lemma 3.1 is, in this case, equivalent to the conjunction of X(Xi) > and X(X 2 ) > 0. 
Thus, Lemma 3.1 can be further strengthened in this case to guarantee that almost 
every realization of the HMM has infinitely many both 1- and 2-barriers. Alternatively, 
assumption (3.11) can be relaxed to (3.12) in this well as in many other practical 

situations, for Lemma 3.1 to still guarantee at least one type of barrier. 

Next, consider case 2. Lemma 3.1 says that when both sets 

Xi = f {x £ X: fi{x)p 2 i > f 2 (x)pi 2 }, 

(3 - 13) 

X 2 = {x £ X : f\{x)p2i < f 2 (x)pi 2 } 

have positive A-measure, then almost every realization xi :00 includes infinitely many 
barriers. One can show that these barriers are the elements of the set B = X\ x X 2 x 
X\ x ■ • • x X 2 . Indeed, it can be shown that the absence of nodes in a generic subsequence 
Xt-.t+T would imply optimality of the likelihood motif Pbafa(x t )p a bfb( x t+i)j a,b £ S , a ^ b. 
However, if x t: t+T £ Xf, x X a x X^ x ■ ■ ■ and T is sufficiently large, then this motif will no 
longer be optimal, hence a node inside x t -t+T- In [28], we additionally show that barriers 
(or nodes) also exist in case 2, even if only one of the sets in (3.13) has positive measure. 
Since a typical Viterbi path in case 2 oscillates between the states (as also acknowledged 
in [5] ) , case 2 is not similar to case 1 , requiring a different approach to prove the existence 
of barriers (or nodes) under the weakened assumption max{A(Ai), X(X 2 )} > 0. This also 
explains why we generally (K > 2) require (3.11) to hold for more than one state. In [5], 
the author reports similar results, Theorems 10 and 11, without proofs, alleging that the 
omitted proofs are "very similar" to the respective proofs of Theorems 7 and 8 of [5] . We 
are convinced that proving Theorem 10 of [5] requires an approach different from that of 
the proof of Theorem 7 in [5] . 
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Finally, case 3 is the mixture model with weights -k\ = p\\ = P2i; i"2 = J>22 =Pi2- Every 
observation is now a node (Example 2). Again, if A({/i /2D > holds, then so does 
(3.12), say, with the first of its statements. Every element of {x G X : fi(x)ni > ^(s)^} 
is then a strong 1-barrier of order and length 1. Therefore, unlike in Theorems 12, 
13 and 14 of [5], the existence of infinitely many barriers (nodes) again follows with no 
additional assumptions. 

In summary, barriers allow us to prove, relatively easily, the existence of infinitely 
many nodes. Although the existence of barriers is rather obvious for K = 2, the CLT- 
based proof of [7], Theorem 2, does not apply if K > 2, necessitating generalizations such 
as Lemma 3.1. 

For certain technical reasons, instead of extracting subsequences of separated nodes 
from general infinite sequences of nodes guaranteed by Lemma 3.1, we achieve node 
separation by adjusting the notion of barriers. Namely, note that two rth order Z-barriers 
Xj-.j+M-i and Xi-.i+M-i might be in B with j < i < j + r, implying that the associated 
nodes Xj+M-r-i and Xi + M~r-i are not separated. Thus, we impose on B the following 
condition: 

Xj:j + M-l,Xi:i+M-l€ B, i^j =*> \i - j\ > V. (3-14) 

If (3.14) holds, then we say that the barriers from B C X are separated. This is often 
easy to achieve by a simple extension of B, as shown in the following example. Suppose 
that there exists x € X such that x ^ B m for all m = 1, 2, . . . , M. All elements of B* = f 
{x} x B are evidently barriers and, moreover, they are now separated. The following 
lemma incorporates a more general version of the above example (sec [26], Appendix 5.2, 
pages 39-40, for proof). 

Lemma 3.2. Suppose that the assumptions of Lemma 3.1 are satisfied. Then, for some 
integers M and r, M > r > 0, there exist B = B\ X • • • X B M C X M , q\ : M £ S M and I € S 
such that every x\. AI S B is a separated l-barrier of order r (and length M), qM-r = I, 
P(X UM £ B\Y UM = qv.M) > and P(Y UM = q UM ) > 0. 

4. The alignment process 

For the rest of this work, we adopt the assumptions of Lemma 3.2 to guarantee that 
almost every realization of HMM has infinitely many separated barriers. Every such 
barrier contains a node. Note that both the barrier and the node encapsulated in it 
are therefore observable via testing the running M-tuples of Xi :oc for membership in 
B. Based on such nodes, we define v : X°° — > S°° to be a proper decoding by piecewise 
alignment (3.9) (and v(xi :oa )i = 1, i > 1, for xi :oa that do not have infinitely many B- 

barriers). Next, we study properties of the random alignment process Vi :oc = v(Xi :oc ). 
Let M > 0, B C X M , r > 0, I e S and q = q UM e S M , as promised by Lemma 3.2. For 

every n > 1, P(Y n . n+M ~i = q) > 0,7* d = P(X n:n+M -i € B\Y n .. n+M -i = q) > 0, hence 
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every £ n:n +A/-i G -B is a separated Z-barrier of order r. Next, define, for all n > 1, 

U„ = X n:n+ M-i, D n = Y„:„ + m-i, Fn d = cr(Yi : „, Xi :n ), as well as 
stopping times Uo, Pi, v^, ■ . . , ^0: ^17 $2, ■ ■ ■ of the filtration {J r n +M~i}n>i'- 

vo d = min{n > 1 : U n G B, D n = q}, 
Vi = f min{n > : U n G 5, L>„ = q} Vi > f , 
$0 = min{n > 1 : U n G B}, 
$i = min{n > i?,„i : U„ E B} Vi> 1, 



with the convention that min0 = and max0 = — f . Note that $i < i/^, i > 0. Stopping 
times (i > 0) are observable via the X process alone, whereas stopping times Vi (i > 0) 
already require knowledge of the full process (Xi :oc , Yi :oo ). Also, note that i^o, (vi+i — Vi), 
i > 0, arc independent and (fj+i — fj), i > 0, are identically distributed. To every ^, there 
corresponds an /-barrier of order r. This barrier extends over the interval [i^, z/j + M — 1] 

and X Ti is an Z-node of order r, where 7* = f Vi + (M — 1 ) — r for every i > 0. Define 

To To and T = f t% — Tj_i = 1^ — f for every i > 1. 

Proposition 4.1. E(T ) < 00 and E(T{) < 00. 

Proof. We need to show that Evq < 00 and E{y\ — vq) < 00. Let us introduce 
the following non-overlapping block- valued processes and D b t , defined by = 
X( m ^i) M +i:mM, D b n = Y ( m -i)Ai+i: m M , for all m > 1, and stopping times defined, for 
every i > 1, by 

i/g d = min{77i > 1 : [7* G B, T>, b „ = 

(4.1) 

^ d ^ f min{m > v\_ x : Uj, G B, D b m = q}, 
R b = min{m>l:D b m =q}, 

(4.2) 

R b ^mm{m>R b _ 1 :D b m = q}. 

The process D b is clearly a time-homogeneous, finite-state Markov chain. Since Y\ :oo is 
aperiodic and irreducible, so is D b . Hence, (D b ,U b ) is also an HMM. 

Since ij. !0 o is also stationary (under 7r), q occurs in every interval of length M with 
the same positive probability (Lemma 3.2). In particular, q belongs to the state space 
of D b . Since D b is irreducible and its state space is finite, all of its states, including q, 
are positive recurrent. Hence, E{R b ) ) < oo and E{R\ — R\) < oo. The following bound 
ultimately yields the second statement: E(y\ — v§) < E(y\ — v\) = -\E(R\ — Rq) < oo. 
This bound is obtained by twice applying Wald's equation [3]. 
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It can similarly be verified that = j*E(R b Q ) + ^-E(R\ - R b ), which is again 

finite. Finally, Ev < M (Evq - 1) + 1< oo. □ 

According to Proposition 4.1 above, ETi < oo for every i > 0, implying that the random 
variables Tq,Ti, . . . form a delayed renewal process (for a general reference, see, e.g., [3]). 
In [5], the process r and the expectation ET\ are denoted by S and E(Si\Sq), respectively. 
As the proof of Proposition 4.1 above shows, using the barriers, it is relatively easy to 
prove that ET\ < oo. On the other hand, without such a unifying concept, [5] must prove 
E(Si\Sq) < oo separately for every case considered therein. 

Next, let wo, mi, . . . be the locations of rth order Z-nodes corresponding to the stopping 
times -di, that is, it, = + (M — 1) — r for every i > 0. Clearly, for every i > 0, n = Uj 
for some j > i. Also, since the barriers are separated, so are (ui)i>o. Using these nodes, 
we build the alignment v and thus extend the definitions of the empirical measures 
Pp(ip,Xi :n ) given in (2.3) and the estimators of transition probabilities pfj given in (2.2) 
for the general case of non-unique alignments. Specifically, given Xi :n , define V{. n = 
v(Xi- n ) to be the (finite) piecewise proper alignment based on the Mi's (and a consistent 
selection scheme) in accordance with (3.9). For each state / £ S that appears in V{. n , 
define 

P?(A; X,. n ) l£ ^^x W (y/) ; A e B 

For other I € S (i.e., Yli=i I{i}(Vl) = 0)' define Pp(ip 7 X 1:n ) to be an arbitrary probability 
measure. 

Similarly, for every pair of states I, j £ S, we define 

Py(V,-*l:nJ - • 

2^=1 Hi}(Vi) 



Again, if Y^i=i I{i}(Yl) = define p"(ip,Xi :n ) to be an arbitrary probability vector on 
S. 

We shall next consider the 2-dimensional process Z = (A 1:oo , Vi :oo ). Based on Z, for 
every I G S\ we also define auxiliary empirical measures Q™ and (9;™)j'eS as follows: 

An/ , 7 \ def SLl^Axji}^,^) _ J2 r i=l I Ax{l}{Z l ) 

€j (Zi-.n) = 1 " for every j G 5. 

As in the definition of -P™("0, A"i :n ), if Z 7^ V^, i = 1, . . . , n (i = 1, . . . , n — 1), then Q"(Zi : „)'s 
(g™(Zi :rl )'s) are defined arbitrarily. Note that, in general, u(xi :0 o)i:n ^w(xi :n ). However, 
the two are equal up to the last node occurring prior to n and used in the construction 
of v. Thus, after that last node, V- need no longer agree with Vi. 
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To prove the existence of Qi such that P™ {ip , X\-. n ) =>■ Qi(ip) a.s., we first note that Z 
is a regenerative process [3] with respect to the renewal times (n)i>o. This implies that 
Qf(Z 1:n ) =>Qi(ip), a.s. Finally, since the difference between Qf(Zi :n ) and P™(^,Xl :t[ ) 
vanishes as n — > oo, we have P" (tp , X\- n ) => Qi(ijj) almost surely. Similarly, we prove the 
almost sure convergence pfj(ijj,Xi :n ) — ► qij(ip). 

The fact that the process Z is regenerative is crucial and is the main result in [5], 
Theorem 2. That X is regenerative immediately follows from the fact that for every 
i > 0, Y Ti = I and the Tj's are renewal times. V is regenerative because all the nodes 
occurring at Tj's are used in the construction of Vi :oo via (3.9) and because decoding 
Vv.oo is proper. That is, for every i > 1, V^- i _ 1 _(_i :T - i = w- 7 £ W l ^(X Ti l+ i- Ti ) for some j > i. 
Hence, for every i > 1, the alignments up to Tj and after r, are independent and V Ti +\:oo 
agrees with F Tl +i:oo in distribution. Regcncrativity of Z with respect to (rj)i>o follows 
straightforwardly and we refer to the formal proof of [5], Theorem 2, for details. 

Theorem 4.1. //X satisfies the assumptions of Lemma 3.1, i/iera </iere exist probability 
measures Q j (-0), Z G 5, such that Q] 1 (ip, X 1:n ) ^.^^ Qi(ip) and PJ 1 (ip, X 1:n ) ^ n ^oo Qi(ip) 
almost surely. 

Proof. The proof below uses regenerativity of Z in a standard way. For every n > r , 
AgB and I E S, we have 

-E^xw(^) = -E^xw(^) + - E ^x W (^) + - E ^xm(^), (4.3) 

where fc(n) = max{fc:rfe < n} is also a renewal process. Now, since tq < oo a.s., we have 

1 T ° T 

~y2 I Ax{i}(Z i ) < — — ► 0, a.s. 
n * — ' n n— >oo 

Let .M d = ETi, which is finite by Proposition 4.1. Then, (n — T k ^)/n < T k ^ +1 /n — > 0, 
a.s. Finally, since Z is regenerative with respect to To,T\, . . . , we have 

~ E ^x W (^) = ^^yE^ whereat E W*). *H 
and are i.i.d. Let mi(A;tp) = f Since mi(A;ip) <A4< oo, it holds that, as n — > oo, 
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implying that (4.3) tends to mi(A;ip)/ M. a.s. Similarly, 

1 " / Tfe \ 

-E 7 ^)^^ 1 a.s., where ^(V) d =^ £ I { i } {Vi)\. 

i=l \i=T fc _ 1 + l / 

Hence, we have shown that for each / £ S and every A£ B, 

Q] l (A;Z 1:n ) — ► Qi(A;i>), a.s., where Q,(A; V) d = m ; (A; ^)M- 

n — >oo 

It is easy to note that A i— > is a measure and that mi(X;4') = wi(i{i). Hence, 

every Qi(tjj) (I G 5) is a probability measure. Recalling that X is a separable metric 
space and invoking the theory of weak convergence of measures now establishes that 
Q?(Zi-. n ) Q;(V0 almost surely. It remains to show that for all I 6 S and Ae B, 

n — >oo 

Ff(A;i/>,X 1:n ) —> Qt(A^), a.s. (4.4) 

n — >oc 

To see this, consider Yja=i ^Ax{i}fti V{). Since V( = Vi for i < Tw„), we obtain 
1 ™ 

i 

m/(A;^))/X. 

Similarly, i X)"=i ^{1} (^7) — > Wi/M almost surely. □ 

Corollary 4.1. If X\ :OQ satisfies the assumptions of Lemma 3.1, then, for every I 6 5*, 
iftere exists a probability measure qn, . . . , ojx on 5 suc/i i/iai p} 1 (V'; Xi :n ) — ► (?/>) and 

•* n — »oo 

qfjZi-n) — ► Qlj(ip) almost surely. 

Proof. The proof is the same as that of Theorem 4.1, with 



n ■ 

i=l 



n 1 — ' n 1 — ' n 

i=l i=T + l «=Tj t („)+l 



a.s. 
n — >oo 



□ 
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We have proposed, in [27], [24] and in this work, to improve the precision of the VT 
estimation by enabling the estimation algorithm to asymptotically confirm the true pa- 
rameters. In this work, we have developed the central theoretical component of the above 
methodology. Namely, we have constructed a suitable infinite Viterbi decoding process 
and have used it to prove the existence of the limiting distributions responsible for the 
'fixed point bias' in a very general class of HMMs. General approaches to the efficient 
computing of the correction functions have been recently proposed in [24] . Model-specific 
implementations of these approaches are a subject of the authors' continuing investiga- 
tion. 
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